Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix/review-issue-2 #4

Merged
merged 9 commits into from
Jan 16, 2025
Merged

fix/review-issue-2 #4

merged 9 commits into from
Jan 16, 2025

Conversation

MQ37
Copy link
Owner

@MQ37 MQ37 commented Jan 13, 2025

PR for #2

Changes

Code Changes

  • Renamed the input parameter from url to startURL for consistency with Website Content Crawler.
  • Removed the default value in actor_input = await Actor.get_input() or {'url': 'https://docs.apify.com/'} to ensure the Actor fails if no value is provided.
  • Added logs to improve the UX when Website Content Crawler is running.
  • Used the walrus operator in record = await kvstore.get_record(store_id) if record is None to improve code readability.
  • Fixed typo in the comment "# changed by get_crawler_actor_config with defailt value 1".
  • Updated CRAWLER_CONFIG to only set values that differ from the default.
  • Added an example to the docstring for the render(data: dict) -> str function to help users understand its usage.
  • Added logging for store = await Actor.open_key_value_store() and await store.set_value('llms.txt', output) to enhance UX.
  • Removed the TODO comment: # TODO: use path or LLM suggestions to group pages into sections # noqa: TD003 and recommended creating an issue for future improvements.

README Changes

  • Mentioned the use of Website Content Crawler, explained its purpose, and provided a link to the Actor.
  • Ensured consistent formatting for all list items in the README.
  • Updated list bullet point A simple, AI-focused structure to help coders, researchers, and AI models easily access and use website content. to include emphasis for consistency.
  • Replaced all caps in titles like Features of llms.txt Generator or Content Extraction to use proper capitalization.

Open Points

  • Clarify why get_description_from_html is used: Sometimes the Website Content Crawler does not return meta descriptions in the dataset, even when available in the HTML. So, I extract the descriptions myself for now, but I can try to fix this issue in the crawler.
  • Add a memory limit for the crawler Actor to ensure it works with the free tier: A 4 GB crawler memory limit is already hardcoded, and this has been mentioned in the README.
  • Consider replacing the dummy example of llms.txt with a real one: Added a real output example of llms.txt generated by the Actor for docs.apify.com and kept the proposed structure example on top.
  • Do not use all caps in the README.md: Replaced all titles and non-entity name words with lowercase letters instead of using title case. Is this correct? @jirispilka

MQ37 added 5 commits January 13, 2025 20:19
…d, mention memory limit in readme, readme mention website content crawler, crawler config only non defaults, added logging and minor code improvements
@MQ37 MQ37 self-assigned this Jan 13, 2025
Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm sorry, a couple more comments as previously, I only did a quick pass.

main.py

  • In the comments, avoid general references to Apify Actors. Be specific to this actor. For example: "Main entry point for this Apify Actor."
  • Do not raise RuntimeError; use Apify.fail instead.
  • Ensure you handle the case when the dataset is empty.

helpers.py

  • I'm not a fan of one-line functions, as they make the code harder to read. For example:
    def render_llms_txt(data: dict) -> str:
        """Renders the `llms.txt` file using the provided data."""
        return render(data)

renderer.py

  • Rather than concatenating strings directly, add items to an array and then use join to concatenate. It is more efficient.

/tests

  • I generally prefer pytest over unittest. pytest has a cleaner syntax, supports fixtures, and, for me, it is easier to work with. Also, it used in python-crawlee, which we (at least myself) considered as Apify standard.

Copy link
Collaborator

@janbuchar janbuchar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with the points made by @jirispilka, and I added some nits of my own. But overall, it looks solid!

"type": "string",
"description": "The URL of website you want to get the llm.txt generated for.",
"description": "The URL from which the crawler will start to generate the llms.txt file.",
"editor": "textfield",
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use the requestListSources editor in this case - see https://docs.apify.com/platform/actors/development/actor-definition/input-schema/specification/v1#array. You may use apify.RequestList to process it afterwards.

Unless you really want to support just one URL every time, of course.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Because of how the format of llms.txt is specified I think the single url input is more suitable - we treat it like an index for the whole site (or sub-site) so the url should act as a root (or sub-root).

@@ -13,6 +13,8 @@
if TYPE_CHECKING:
from apify_client.clients import KeyValueStoreClientAsync

logger = logging.getLogger('apify')
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can just use Actor.log instead.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using Actor.log in main.py now, but in helpers.py pytest prints warnings because of non existent event loop, so I will keep the logger there.

src/main.py Show resolved Hide resolved
src/renderer.py Outdated Show resolved Hide resolved
…s, render using join, switch to pytest, actor.log in main
@MQ37
Copy link
Owner Author

MQ37 commented Jan 15, 2025

I'm sorry, a couple more comments as previously, I only did a quick pass.

main.py

* In the comments, avoid general references to Apify Actors. Be specific to this actor. For example: "Main entry point for this Apify Actor."

* Do not raise `RuntimeError`; use `Apify.fail` instead.

* Ensure you handle the case when the dataset is empty.

helpers.py

* I'm not a fan of one-line functions, as they make the code harder to read. For example:
  ```python
  def render_llms_txt(data: dict) -> str:
      """Renders the `llms.txt` file using the provided data."""
      return render(data)
  ```

renderer.py

* Rather than concatenating strings directly, add items to an array and then use `join` to concatenate. It is more efficient.

/tests

* I generally prefer `pytest` over `unittest`. `pytest` has a cleaner syntax, supports fixtures, and, for me, it is easier to work with. Also, it used in `python-crawlee`, which we (at least myself) considered as Apify standard.
  • refactored the comments
  • Apify.fail is called by the context manager automatically on raise, which I am using
  • added check for empty dataset from WCC call - actor will fail
  • removed the render_llms_txt from helpers.py
  • using string join in renderer now
  • switched over to pytest

@MQ37 MQ37 requested review from jirispilka and janbuchar January 15, 2025 13:37
README.md Outdated

The **llms.txt Generator Actor** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA.
The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
The **llms.txt generator** is an Apify tool that helps you extract essential website content and generate an **llms.txt** file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This tool leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.
The **llms.txt generator** is Apify Actor that helps you extract essential website content and generate an [llms.txt](https://llmstxt.org/) file, making your content ready for AI-powered applications such as fine-tuning, indexing, and integrating large language models (LLMs) like GPT-4, ChatGPT, or LLaMA. This Actor leverages the [Website Content Crawler](https://apify.com/apify/website-content-crawler) Actor to perform deep crawls and extract text content from web pages, ensuring comprehensive data collection. The Website Content Crawler is particularly useful because it supports output in multiple formats, including markdown, which is used by the **llms.txt**.

Copy link
Owner Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

capitalized 'Actor' and added llms.txt org link to top description

Copy link
Collaborator

@jirispilka jirispilka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nit: tool -> Actor

@MQ37
Copy link
Owner Author

MQ37 commented Jan 16, 2025

One nit: tool -> Actor

changed tool -> Actor in README

@jirispilka
Copy link
Collaborator

Thanks for the changes!

@MQ37 MQ37 merged commit c274710 into master Jan 16, 2025
1 check passed
@MQ37 MQ37 deleted the fix/review-issue-2 branch January 16, 2025 10:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants